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Abstract 

Background: Next-generation DNA sequencing platfornns are capable of generating millions of reads in a matter of 
days at rapidly reducing costs. Despite its proliferation and technological improvements, the performance of 
next-generation sequencing remains adversely affected by the imperfections in the underlying biochemical and 
signal acquisition procedures. To this end, various techniques, including statistical methods, are used to improve read 
lengths and accuracy of these systems. Development of high performing base calling algorithms that are 
computationally efficient and scalable is an ongoing challenge. 

Results: We develop model-based statistical methods for fast and accurate base calling in lllumina's next-generation 
sequencing platforms. In particular, we propose a computationally tractable parametric model which enables 
dynamic programming formulation of the base calling problem. Forward-backward and soft-output Viterbi 
algorithms are developed, and their performance and complexity are investigated and compared with the existing 
state-of-the-art base calling methods for this platform. A C code implementation of our algorithm named Softy can be 
downloaded from https://sourceforge.net/projects/dynamicprog. 

Conclusion: We demonstrate high accuracy and speed of the proposed methods on reads obtained using lllumina's 
Genome Analyzer II and HiSeq2000. In addition to performing reliable and fast base calling, the developed algorithms 
enable incorporation of prior knowledge which can be utilized for parameter estimation and is potentially beneficial 
in various downstream applications. 



Background 

Technological advancements in sequencing technologies 
have enabled fast sequencing of millions of reads at a 
rapidly reducing cost. Sequencing-by-synthesis, includ- 
ing Illuminas platforms based on reversible terminator 
chemistry and 454s pyrosequencing platforms, is taking 
us closer to affordable routine sequencing tasks. However, 
inherent uncertainties in the ensemble-based sequencing- 
by-synthesis, along with data acquisition noise, present a 
major bottleneck in the quest for reads that are as long 
and reliable as those provided by the conventional Sanger 
sequencing method. 

Sequencing-by-synthesis on Illuminas platforms typi- 
cally involves the following steps. First, multiple copies 
of the genome being sequenced are broken into short 
fragments in a random fashion, followed by ligation of 
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sequencing adapters to the fragments. In the next phase, 
the DNA sample is introduced into one of the 8 lanes 
(or 16 flow cells for HiSeq) split into multiple tiles each 
containing a lawn of primers covalently bonded to the sur- 
face to generate clonal clusters of the captured forward 
and reverse DNA strands. In particular, captured strands 
hybridize to neighboring primers to form so-called U- 
shaped bridges, followed by the process of bridge ampli- 
fication which is repeated ^ 35 times to generate clusters 
containing ^ 2000 molecules. In the final stage, "sequenc- 
ing", fluorescently labeled reversible terminators (different 
color for each base) are introduced and incorporated into 
the complementary strands of the DNA templates. Imag- 
ing of the fluorescently labeled clusters is followed by 
cleavage and unblocking of the incorporated nucleotides, 
and the labeled reversible terminators are added anew to 
proceed with synthesis of the complementary strands. 

Images acquired at the end of each sequencing cycle 
are processed, and the resulting signals are analyzed to 



O© 201 3 Das and Vikalo; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative 
BIoIVIGCI Ccntrsl commons Attribution License (http://creativecommons.Org/licenses/by/2.0), which permits unrestricted use, distribution, and 
reproduction in any medium, provided the original work is properly cited. 



Das and Vikalo BMCBioinformatics 201 3, 14:1 29 
http://www.biomedcentral.eom/1 471 -2 1 05/1 4/1 29 



Page 2 of 10 



determine the type of nucleotides incorporated into the 
complementary strands. The problem of inferring the 
order of nucleotides in a template from the noisy signals is 
referred to as base calling. For base calling, lUumina uses 
its in-house software Bustard. Fundamentally, Bustard 
attempts to reverse the effects of various uncertainties on 
the signal and then makes base calls. While it is computa- 
tionally very fast, the error rates resulting from Bustard s 
calls may be significantly improved by more sophisticated 
algorithms [1]. 

Several base calling strategies for the Illumina platform 
have been proposed in recent years such as [2-4]. In [5], 
a parametric model of the sequencing-by-synthesis pro- 
cess was proposed, and a Monte Carlo method for base 
calling was presented. While having better performance 
than Bustard, this scheme is computationally intensive 
and impractical for processing tens of millions of reads 
generated by today's sequencers. Kao and Song 2010 [6], 
the follow-up paper, suggested a modification leading to 
a computationally feasible base calling albeit with slight 
degradation in performance compared to [5]. In [7], an 
algorithm achieving significant improvement in speed at 
the cost of a minor deterioration of base calling error rate 
as compared to [5] was presented. 

Following a simplification of the parametric model of 
lUuminas sequencing-by-synthesis platform proposed in 
[5], in this paper we present a formulation of the base 
calling problem amenable to being solved by dynamic pro- 
gramming methods. In particular, we derive the forward- 
backward and soft-output Viterbi algorithm (SOVA) for 
solving the base calling problem. The performance of 
the proposed algorithms is demonstrated on experimental 
reads acquired from Illumina s Genome Analyzer II and 
HiSeq2000 and compared with several recent base calling 
techniques. 

Methods 

In this section, we present the mathematical model that 
leads to the dynamic programming formulation of the 
base calling problem, and present algorithms for base 
calling and parameter estimation. 

Comprehensive model of sequencing-by-synthesis in 
lllumina's platforms 

A detailed mathematical model of lUumina's platforms 
which describes various sources of uncertainty in the 
sequencing process and data acquisition step was intro- 
duced in [5]. For the sake of self-contained presentation, 
we briefly review this model here while omitting details 
for brevity. 

The 4-dimensional observation vector Y/ (inten- 
sity) acquired in the cycle (/ = 1, 2, . . . , A/^) of the 
sequencing-by-synthesis process can be expressed as 



Yi-^NiKXi.'Li) i=h 

YilYi.i^XiKXi + aYi^i.Hi) / = 2,....Ar, (1) 

where K is the 4x4 cross-talk matrix quantifying overlap 
of the emission spectra of the four fluorescent tags used 
to label nucleotide bases, a is the parameter accounting 
for empirically observed signal leakage between consecu- 
tive cycles, Xi is the signal generated in the i^^ cycle, and 
5:,- = WXiWjj: is the variance describing multiplicative 
measurement noise that primarily originates from fluctu- 
ations in K. The generated signal, Xi, is affected by phasing 
and pre-phasing. In an ideal setting, addition of the four 
terminating base nucleotides during the sequencing step 
should lead to a single base getting incorporated into 
each of the complementary strands. However, nucleotide 
incorporation is not perfect and phasing (when no base is 
incorporated) or pre-phasing (when more than 1 base is 
incorporated) may occur. These effects are modeled prob- 
abilistically: it is assumed that no base is incorporated 
with probability pu, while with probability pcj more than 1 
base is incorporated. For the sake of tractability of the final 
model and practical feasibility of base calling, we assume 
that at most two bases may be incorporated into a comple- 
mentary strand in a single cycle. Define an (L + 1) x (L-\-l) 
transition matrix P with entries 

Pii if j = h 

_ I i\-Pii)i\-Pcf) if ; = / + !, 

PcfO^-Pii) if ; = / + 2, 

0 otherwise, 

where is the probability of a complementary strand 
extending from length / to length ; in any given cycle. Then 
the signal generated in the cycle, Xi, can be expressed 
as 

Xi = XiZi = AKSE^),-, 

where S is a 4 x L matrix (The i^^ column of S, 5/, has 
all zero entries except for one indicating the base in the 
i^^ position of the template. We follow the convention 
where the first component corresponds to A, the second 
to C, the third to G and the fourth to T) representing 
the template sequence of length Z, E is an x Z matrix 
having entries £y =[P^]o,y equal to the probability that 
the length of a strand after / cycles is and ki is a ran- 
dom variable describing empirically observed signal decay 
caused by the DNA loss due to primer-template melting, 
digestion by enzymatic impurities, DNA dissociation and 
misincorporation, 

ki I Xi-i - - d)Xi-i, (1 - dfX^_^a^), (3) 

where d is s. constant droop factor over all cycles and all 
reads and a is the standard deviation of A/. 
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lllumina's base calling approach 

Prior to base calling, Bustard (lUuminas base calling soft- 
ware) estimates cross-talk using signals generated by syn- 
thesizing the first 2 bases of all reads, evaluating entries 
of K as the median of the estimates obtained using indi- 
vidual read signals. Bustard then infers X by inverting K 
and multiplying it with Y. Next, it calculates a tile-wide 
average scalar Xi = ^Xj^i and renormalizes the signal 
by multiplying Xi by Xi/Xi. This procedure corrects for 
the signal droop. Matrix E is estimated from the first 12 
bases, inverted and multiplied by the normalized Xi val- 
ues. This compensates for phasing/prephasing. Finally, for 
each cycle, base calling is performed by selecting the base 
inferred as having the highest corrected signal. 

Parameter estimation and basecalling approach ofBayesCall 
and naiveBayesCall 

BayesCall relies on the comprehensive model reviewed 
in this section to perform base calling and significantly 
reduce error rates compared to Bustard. However, it suf- 
fers from two major computational bottlenecks. First, the 
lack of a closed form expression for the solution to the 
E-step of the EM algorithm used for parameter estima- 
tion necessitates a computationally intensive numerical 
optimization. Hence, the parameter estimation stage is 
time consuming, requiring ^ 25 minutes per iteration 
on an 8 core machine. Consequently, BayesCall performs 
a single parameter estimation step that uses reads from 
all the tiles in a lane to generate a single set of param- 
eters for the entire lane. Detailed analysis of BayesCall 
and naiveBayesCall error rates indicates that using a sin- 
gle set of parameters for an entire lane results in serious 
performance degradation for tiles where the parameters 
significantly differ from the ones used by the base call- 
ing algorithms (data not shown). Moreover, base calling in 
BayesCall is performed by relying on simulated annealing. 
Being a computationally intensive algorithm, the times 
for base calling via simulated annealing are prohibitively 
high. In order to overcome this issue, in the follow-up 
paper [6], the authors propose a simplified heuristic which 
reduces base calling times to ^ 6 hours per lane with a 
small reduction in performance compared to BayesCall. 
However, since the parameter estimation step used by 
naiveBayesCall is the same as the one used by BayesCall, 
tiles with parameters which significantly differ from the 
single parameter set computed for the entire lane have 
very small performance improvements over Bustard. 

Our model refinements for fast tractable base calling 

While providing detailed description of various sources of 
uncertainty, mathematical model of the sequencing pro- 
cess overviewed in the previous section leads to compu- 
tationally demanding base calling algorithms. To simplify 
the model and enable practically feasible base calling, we 



approximate X/ by its mean. Such an approximation is jus- 
tified by the analysis of experimental data which shows 
that the coefficient of variation (ratio of the standard devi- 
ation to the mean) of X/ in (3) is small (typically below 
0.1 for early cycles and below 0.06 in the latter ones) [7]. 
On the other hand, it is desirable that the model allows 
variations in the droop factor from one cycle to another. 
Therefore, we describe the decay as Xi = k YVj=2^^ ~ ^/)' 
where A denotes a read-dependent transduction coeffi- 
cient mapping synthesis events to the generated signal 
intensity, and dj denotes cycle-dependent droop factors. 

Note that the signal generated in the i^^ cycle of the 
sequencing step, (SE^)/, can be expressed as 

L L 

where fii^js are dependent on pa and pcf. Based on the 
initial parameter estimates obtained using Monte Carlo 
methods, we observe that pa is very small. It is also 
observed that if we choose to retain only those terms in a 
given row of E that are at least 10% of the maximum entry, 
there is at most one base ahead of the tested base that 
contributes significantly to the signal Consequently, we 
may approximate (SE^)i in (4) as 

(SE^),- ^ (1 - + (5) 

where fii,i-\-i is a cycle-dependent parameter which allows 
us to essentially approximate the nonlinear dependence of 
E on pcf and hence facilitate efficient base calling. 

For any given cycle the intensity of the signal in (5) is 
a function of Si and 5/+!. Such a finite memory approxi- 
mation enables search for the optimal path ^i, ^2, . . . , <Sl 
using dynamic programming principles. Graphically, this 
can be interpreted as the search on a 16-state trellis (The 
number of states needs to be increased if the parame- 
ters Pa and/or pcf are large, or if longer reads need to be 
called), where the states at the i^^ stage of the trellis rep- 
resent all possible pairs of bases in the i^^ and (i + lY^ 
position of a read. We denote the states of the trellis by T/, 
where / is the cycle number. The states can take one of 16 
possible values in the set AC, AG, . . . , TT}, I < j < 
16. Note that not all state transitions are feasible. In par- 
ticular, a transition from a state in cycle / to a state in cycle 
/+1 is valid if the second symbol of the state in cycle / is the 
same as the first symbol of the state in cycle / + 1. Figure 1 
illustrates two consecutive stages of the trellis. For the 
sake of tract ability of the illustration, only 8 of the possi- 
ble 16 states are shown. Arrows indicate valid transitions 
between the states that are included in the illustration. 
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Figure 1 The 1 6 state trellis illustration of the transitions 
between states in the i^^ and (i + 1)*^ stage of the trellis. The 

figure shows 8 out of the possible 16 states along with all valid 
transitions between them. 



Final model 

Based on the discussion in the preceding section, the final 
model (for the /^^ window) is of the form 

Yi^^iXKiXi^k^WXiWllli) 1 = 1 

i 

;=2 (6) 
{k]\{l-~dj)f\\Xi\\lYi) i = %„„N 

7=2 



where X/ = and is relabeled as ft 

for the simplicity of notation. Let us collect all the param- 
eters into a vector 0. Given 0, k and ^i, . . . , Yn^ the 
goal of base calling is to determine 5'i,5'2, . . . In our 
approach, we first obtain estimates of the parameters 0 
using an unsupervised learning scheme. Then, posterior 
probabilities of Si are determined using either forward- 
backward or soft-output Viterbi algorithm. Details of the 
proposed scheme follow. 

Parameter estimation 

We infer parameters of the mathematical model (6) by 
relying on an unsupervised estimation scheme. Unsuper- 
vised estimation need not be aided by a reference genome 
nor does it require analyzing a known sequence in a con- 
trol lane. Our scheme also has the advantage of being 
implemented as an online, as opposed to a batch, algo- 
rithm. This allows parameter estimation (and base calling) 
of a previous window to be performed while the exper- 
iment is still in progress, resulting in smaller latency 
between the end of the run and basecalling results. In par- 
ticular, we employ the online expectation-maximization 
(EM) algorithm [8] which relies on a training set oi R = 
250 reads randomly selected across a tile. The optimiza- 
tion problem that the EM algorithm solves in an iterative 
fashion can be stated as 

0^ = argm^xE(;,,s)|0,_JlogP(S,A|Y,0)], (7) 

where the scalar coefficient k and the template sequence 
matrix S are latent variables, 0 = {a, ^, K, E, ^y} is 
the set of parameters which need to be determined, and 
logP(S, A.|Y, 0) denotes the log-posterior function. In the 
absence of any prior information, this is also the log- 
likelihood function. The expectation in (7) needs to be 
evaluated with respect to k and S given the current esti- 
mates of 0. 

The results from [5] indicate that base calling may be 
significantly improved by allowing the parameters to be 
cycle dependent. We observe the same and thus divide a 
sequencing run into windows of length = 6, and esti- 
mate model parameters window-by-window. Parameters 
for window / are initialized using the values of the param- 
eters estimated in the previous window, / — 1. To prevent 
over-fitting, (7) is optimized over two windows, / and /+ 1, 
and the resulting 0 is used as the set of parameters for 
window /. A window length W = 6 was found to be short 
enough to capture time variations in the parameters and 
still maintain run times of the parameter estimation and 
base calling low. 

Initialization for the first window 

The EM algorithm requires initialization of the parame- 
ters in the first time window. We can reliably call the first 
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two bases in a template by simply identifying the chan- 
nels having the largest signal in the first two test cycles 
from which an initial estimate of K is obtained. As done 
by Bustard, multiple estimates of the columns of K can 
be computed from the first two signals of each read in 
a tile, and then the median of all these estimates can be 
used to provide an initial estimate K. Subsequently, we 
find the mean of each column and add this to the diagonal 
entries. We then use the inverse of the resulting matrix to 
call bases again and iteratively refine the estimates of the 
entries of the matrix. The number of iterations is set to 5. 

Given K, an empirical estimate of E is obtained by com- 
puting the difference between the intensity vector and the 
column of the cross talk matrix that corresponds to the 
called base. The covariance matrix is computed from 
this for each read. Finally, the estimate E is formed as the 
median of E^. Parameters ai and are negligible in the 
early cycles and are initialized as zeros. To estimate droop 
coefficients di, we start by calculating for each read 

and summing up the resulting vector elements to obtain 
the total signal x\ acquired in the i^^ cycle. The droop for 

the i^^ cycle is then calculated as d^- = x\/x^-_-^ for each 

individual read. Finally, the median value of all d^- is cho- 
sen as the initial value of di. Details of this step are omitted 
and the interested reader is referred to [7]. 

E-step for the first window 

The E-step requires finding the expectation of the log- 
likelihood function in (7) over k and S. Closed form 
expressions are not available, while the numerical Monte- 
Carlo methods are computationally prohibitive in prac- 
tice. As an alternative, we rely on Bustard s approach to 
call sequences in the training set and use the resulting 
S^y I < k < Ry to approximate the expectation with 
respect to S. In particular, we approximate the objective of 
maximization (7) by 



0 = -^E,. 



il+l)W 



i=(l-l)W+l 



(8) 



where 



;=2 

(Y^-X^ ri a-di)KiXffj:-\Y^ - A* n (1 - dj)KiXf) 

;=2 /=2 



{x'<Y\a-~di))HXip 
;=2 



(9) 



and Yi = Yi for i = 1 and Yi = Yi — otYi-i for />!,/<= 
N. The superscript k is an index of a read in the training 
set and ranges from 1 to R. Then the expectation over 



in (8) is evaluated numerically via importance sampling, 
leading to an approximation of the objective function 



O 



k=li={l-l)W+l ;=1 



1 * 



Nis 

^lV;,^L(A;,Sf,0/), 



(10) 



where wjj^ denote normalized weights of Nis = 500 sam- 
ples kj generated for each read in the training set from the 

Gaussian distribution N(X^y0.l) (such a choice of sam- 
pling distribution for is suggested by the analysis of 
experimental data). The mean of the sampling distribution 
for each read in the training set, A^, is obtained by max- 
imizing the log-likelihood function (9) given the current 
estimates of the parameters 0 and base calls S^. 

M-step for first window 

The objective function in (9) is separately differentiable 
and convex over each of the parameters in 0 except p. 
To optimize it, we rely on a cyclic co-ordinate descent 
scheme which rotates among the components of 0. To 
find fiy we employ a grid search. The co-ordinate descent 
is terminated when the ratio of the change in the value of 
the objective function to the value of the objective func- 
tion in a previous iteration is less than e = 0.003. We 
use a similar stopping criterion for termination of the 
expectation-maximization algorithm. 

E-step for subsequent windows 

Due to phasing effects and other imperfections affect- 
ing generated and measured signal, using Bustards calls 
to approximate expectation of the log-likelihood func- 
tion as in (8) fails to provide reliable parameter esti- 
mates in subsequent windows. On the other hand, 
numerical evaluation of the objective function in (7), 
IE(A.,S)|0„_i [logIP(S, A|Y, already argued in this 

section, is computationally prohibitive in practice. To 
facilitate practically feasible evaluation of the E-step for 
windows I > 1, for each read in the training set we approx- 
imate the transduction coefficient k^ by its mean, A^, and 
replace the objective function by 

R M 

^ ^P(5dY^ 0) log(P(5,-|Y^ 0)), (11) 

k=l i=l 

where A,'"^ is obtained by maximizing the log-likelihood 
function (9) given the parameters inferred in the (/ — 1)^^ 
window. Posteriori probabilities F(Si\Y^y A,^, 0) needed to 
evaluate expression (11) can be found from the state pos- 
teriori probabilities. For instance, posteriori probability 
that the i^^ base is A is 

4 

P(Sf =[1 000] |Y^ A^ 0) = ^P(r/ = ^;|Y, A^ 0), 

(12) 
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whereby g {AA,AC,AG,AT,CA, . . . ,TT}, 1 < j < 16. 
Clearly, we need to find P(r/ = tj\Y,X'^, 0). For this, we 
turn to dynamic programming ideas - in particular, the 
forward-backward and soft-output Viterbi algorithms. 

Forward-backward algorithm 

Denote the transition probability from state k in the i^^ 
stage to state / in the (i + 1)^^ stage of the trellis by aj^i = 
F(Ti = tj^\Ti-^i = ti). If no prior information about transi- 
tion probabilities is available, we will assume that the valid 
transitions are equally likely. Moreover, note that the state 
priors may be computed from the symbol priors, if those 
are available. For instance, prior for the state 7/ = AC can 
be found as the product of the priors ionSj =[1 0 0 0] 
and Sj_^^=[OlOOl Let //(/) = PCn, ^2, • • • , Ti = ti) 
denote the so-called forward probabilities, and bi{i) = 
P(y^+i, . . .^YmITi = ti) denote the backward probabil- 
ities. Moreover, let eiiXi) = P(Y/|r/ = ^/,A., 0) denote 
emission probabilities. Then the recursion that computes 
forward probabilities can be stated as 

16 

//(/ + 1) = eiiYi^i) ^fk{i)akb 
k=i 

while the backward recursion is given by 

16 

h(i) = ^ei(Yi^i)af,ibi(i + 1), 
1=1 

The recursions are initialized by setting /o (0) = 1 and 
b/((M) = Uk^e, where a]<^^e denotes the probabilities of the 
terminating state as computed by the forward algorithm. 
Finally, the posterior probability is obtained as 

fk(i)bk(i) 

J2j=iMi)bj(i) 

for all 1 < < 16, 1 < / < M. In order to ensure that the 
finite size of the trellis does not adversely effect reliability 
of the computed probabilities, we add an extra 5 cycles in 
the calculations (i.e., we use l^i, . . . , Im+s)- 

Soft output Viterbi algorithm 

The forward-backward algorithm computes exact pos- 
teriori probabilities of the bases in a sequence. On the 
other hand, one can rely on various heuristics to obtain 
reasonably good approximations of posteriori probabili- 
ties while suffering only minor degradation in accuracy. 
Such heuristics include the soft-output Viterbi algorithm 
(SOYA), a modification of the Viterbi algorithm imple- 
mented on the same trellis we described in previous 
sections. 

Let Vk(i) denote the probability of the most likely state 
sequence which ends at 7/ = t^, i.e., 

Vk(i)= max P(ri,...,r,-,Ti,...,r,_i,r, = /r). 



Retaining the notation introduced for the description of 
the forward-backward algorithm, we can recursively com- 
pute v^(0 as 

vi(i +l) = ei(Yi^i) rmxakiVk(i), 

k 

where the recursion is initialized by setting vo(0) = 1, 
Vy^(O) = 0 for all k > 0. This recursion is at the core of 
the Viterbi algorithm, which then proceeds by backtrack- 
ing through the optimal trellis path to determine the most 
likely sequence of states. The Viterbi algorithm, however, 
provides only the most likely sequence of states and does 
not find posteriori probabilities of the symbols. To this 
end, a soft-output variant of the Viterbi algorithm was 
proposed in [9]. SOVA traces back optimal path through 
the trellis and for each symbol (i.e., base) explores alter- 
native paths that could have changed the decision of the 
Viterbi algorithm for that symbol Cost metrics of the 
alternative paths are then used to approximate posteriori 
probabilities for the base under consideration. To allow 
computationally efficient procedure, we limit the length 
of deviation of the alternative paths from the optimal one 
to 3 edges. Note that it is necessary to normalize the 
posterior probabilities obtained in the described fashion. 
As we will demonstrate in the subsequent sections, the 
forward-backward algorithm achieves better base calling 
error rates than SOVA, but it does so at the cost of having 
reduced speed. 

M-step for subsequent windows 

The M-step for subsequent windows is very similar to the 
M-step for the first window. The only difference is that the 
objective function being maximized is now 

^ R (l-^l)W 4 

- 2 J2 nSi=sj\Y^ IK 0/)L(Ay, 5,-5^ 0/). 

k=l i={l-l)W+l j=i 

(13) 

The optimization follows the same procedure as described 
for the first window. 

Updating - After each step of the EM algorithm used 
for estimating parameters in a given window, we make 
calls for (using outputs of either forward-backward or 
SOVA). The calls and the most recent parameters are then 
employed to update by maximizing the log-likelihood 
function (9). The updated value of X^ is used by the EM 
algorithm in the next window. 

Base calling 

Given 0 inferred by the EM algorithm and Fi, • • • > 
the goal of base calling is to determine 5i,52, . . .ySiy i.e., 
to find 

Si = argmaxP(5/ = sj\Y,X, 0), (14) 
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where sj can take values of unit vectors comprising three 
zeros and one non-zero entry equal to 1, and 1 < / < 
4. Base probabilities ¥(Si = 5/|Y,X, 0) can be calculated 
from the state probabilities of the trellis that we defined in 
the parameter estimation section, e.g., 



P(5f =[1 000]) = ^P(r,- = tj\Y,k, 0), 
7=1 



(15) 



and so on. Note that these probabilities are also the 'qual- 
ity score' assigned to the given basecall (more on quality 
scores in the next section). Clearly, we need to find pos- 
teriori probabilities P(J/ = ^/ |Y, X, 0). For this, we again 
turn to the soft-output Viterbi and forward-backward 
algorithms that we described in the previous section. 

Note that the value of A used for base calling in window 
/ is approximated by the value of k which maximizes the 
log-likelihood function formed using 0 and Si from the 
previous window, I — I (except in window 1=1 where 
we use Si provided by Bustard). It is straightforward to 
show that this maximization entails solving the quadratic 
equation in A 



iw 



^ (KiXif^rHYi) 



■X 



i=ii-i)w+i (0(1 -4)) KIP 

Y^r^Yi 



(16) 



= 0, 



(n(i-4))'ii^/ip 

and choosing the positive solution as the value of X. 
Quality scores 

Performance of various base calling algorithms can be 
compared by evaluating error rates that they achieve 
when applied to determining the order of nucleotides 
in a known sequence. In practical applications, where 
the sequence being analyzed is not known, we need to 
assess the confidence of a base calling procedure. To this 
end, quality scores provide information as to how reli- 
able the corresponding base calls are. The quality scores 
that we assign to base calls are the posterior probabilities 
of the bases computed by the forward-backward/SOVA 
schemes. In particular, we use the posteriori probabilities 
of the bases computed according to (15) as the qual- 
ity scores. In order to assess the 'goodness' of quaUty 
scores, we consider their discrimination ability [10,11]. 
The discrimination ability for a given error rate is obtained 
by sorting all bases according to their quality scores in 
descending order and finding the number of bases called 
before the error rate exceeds the predefined threshold. 



Results 

GAM 

Performance of the forward-backward algorithm and 
SOVA is verified on a full lane data obtained by sequenc- 
ing phiX174 ((EMBL/NCBI accession number J02482) 
bacteriophage using lUumina's Genome Analyzer II which 
generates reads of length 76. After basecalling the lane by 
Bustard, naiveBayesCall, Rolexa, Ibis, forward-backward 
and SOVA, the calls were mapped onto the known refer- 
ence sequence comprising 5386 bases. The optimal align- 
ment is found using a Hamming distance metric. Reads 
that map with less than 30% errors are retained while 
reads having more errors are removed to ensure that there 
is no ambiguity in the alignment. This results in approx- 
imately 7 million reads and 550 million bases which are 
used to compare the performance of the considered base- 
calling schemes. Average error rates computed over the 
entire lane are compared in Table 1. Figure 2 shows the by 
tile error rates, by cycle error rates and the discrimination 
abilities of the different basecallers. Forward-backward 
algorithm and SOVA outperform all other schemes in 
terms of error rates and discrimination abilities. 

HiSeq 

Performance of the forward-backward algorithm and 
SOVA is verified on reads from E.coli (EMBL/NCBI acces- 
sion number NC007779) using Illumina s HiSeq2000 com- 
prising of 100 cycle paired end data. The error rates 
for both pairs of reads are shown as a function of cycle 
number in Figure 3. Average error rates are compared in 
Table 2 for both SOVA and FB schemes. As can be seen, 
we improve on Bustard's calls by 12.3 and 9.6% for the first 
and second pair respectively. 

Discussion 

Computational complexity 

For each read, the most computationally expensive Bus- 
tard's step is its correction of phasing effects. For both 
forward-backward algorithm and SOVA, we need to 
evaluate 16 objective functions for the states at each stage 

Table 1 Comparison of error rates and speed for GAM 

Decoding strategy Error rate Running times 



FB 


0.0128 


400mins 


SOVA 


0.0129 


300nnins 


OnlineCall 


0.0137 


30mins 


naiveBayesCall 


0.0139 


ISOOmins 


Ibis 


0.0147 


480mins 


Bustard 


0.0154 


40nnins 


Rolexa 


0.0171 


720mins 



A comparison of error rates and running times (per lane) for different base 
callers (note that Bustard's running time is underestimated since it does not 
account for the parameter estimation step). 
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Figure 2 (a,b,c) - Comparison of the different basecalling strategies under different performance metrics. The figure shows a) Base calling 
error rates as a function of tile, b) Base calling error rates as a function of cycle number, c) Discrimination ability. 



of the trellis. In order to avoid finite window effects, 
for each window of length 6 additional 5 cycles are 
included in the computations. Therefore, for a 76 cycle 
read, we need to evaluate 131 x 16 state values. Addi- 
tional overhead due to combining these values requires 
mostly additions (when the algorithms are implemented 
in the log domain). naiveBayesCall on the other hand, 
performs matrix inversion of the same complexity as those 
performed by Bustard, followed by evaluation of 4 x 76 x 



21 terms. The factor 21 arises due to the fact that naive- 
BayesCall needs to solve a quartic equation using a golden 
section search that requires 21 evaluations per base. Thus, 
forward backward and SOVA are ^ 3 times faster than 
naiveBayesCall. 

Implementation and running times 

We implemented our codes on an Intel 17 machine 
@3.07GHz using only a single core. With our codes 
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Figure 3 (a,b) - Comparison of the different basecalling strategies for HiSeq2000. The figure shows a) Base calling error rates as a function of 
cycle for the first pair, b) Base calling error rates as a function of cycle for the second pair. 



written in C, it takes approximately 240 seconds to read in 
an intensity file, perform the parameter estimation step on 
250 reads, call bases for the whole tile and write it in fastq 
format for the forward backward scheme and 180 seconds 
for SOYA. Processing an entire lane requires about 400 
minutes and 300 minutes for FB and SOYA, respectively. 
naiveBayesCall, on the other hand, requires 19 hours just 
for its parameter estimation step while its basecalling 
takes 6 hours. Thus, our FB and SOYA implementations 

Table 2 Comparison of error rates for HiSeq 



Decoding strategy 


Error rate (Pair 1) 


Error rate (Pair 2) 


FB 


0.0029 


0.0029 


SOVA 


0.0029 


0.0029 


Bustard 


0.0033 


0.0032 



are 4 and 5 times faster than naiveBayesCall. Note that 
the run times of naiveBayesCall are reported for an imple- 
mentation on a processor with 8 cores; it is expected 
that a parallel implementation of our algorithm would 
reduce the total running time by roughly 8 times. In 



Table 3 Comparison of error rates for supervised and 
unsupervised schemes 

Error rate 

0.0125 
0.0127 
0.0124 
0.0126 



A comparison of error rates for different base callers for HiSeq. 



Decoding strategy 

FB (unsupervised) 

SOVA (unsupervised) 

FB (supervised) 

SOVA (supervised) 

A comparison of error rates for supervised and unsupervised schemes on a 
single tile for GAII. 
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addition, our proposed schemes would be able to almost 
instantaneously provide very high quality base calls to the 
end user since they are online (as opposed to batch) in 
nature. A comparison of the running times for processing 
an entire lane between the forward-backward algorithm 
and SOYA and the other basecallers is shown in Table 1. 

Improving error rates using supervised parameter 
estimation 

Although the described parameter estimation procedure 
assumes no supervision, the proposed forward-backward 
and SOYA schemes allow incorporation of non-uniform 
priors that may improve accuracy of the inferred param- 
eters and hence the overall base calling performance. 
Illumina platforms typically have a dedicated control lane 
comprising reads from a known reference. In such a case, 
it is possible to obtain priors by aligning the reads onto the 
reference and using them to improve the accuracy of the 
estimated parameters. 

To this end, we utilize the calls from Bustard and align 
the reads onto the reference phiXll^ genome using the 
same mapping criteria as described in the Results section. 
A basecall that is perfectly mapped to the reference is 
assigned a prior probability of 1, while in case of a mis- 
match the prior probabilities are split between the base 
suggested by the reference and the base called. If the refer- 
ence is not very trustworthy, lower prior can be assigned 
to the base implicated by the reference. The described 
change requires very minor modification of the parameter 
estimation step. Table 3 shows the improvement obtained 
using the supervised scheme. Both forward-backward and 
SOYA schemes benefit marginally if the parameters are 
estimated in the supervised setting. 

Conclusion 

We presented a formulation of the base calling prob- 
lem on Illumina platforms that is amenable to being 
solved by dynamic programming methods, and proposed 
forward-backward and soft-output Yiterbi algorithms for 
solving it. Base calling error rate performance of the 
proposed algorithms was demonstrated on experimen- 
tal data to be superior to lUuminas Bustard and several 
other publicly available base callers. The developed base 
callers are tested on data obtained by Genome Analyzer II 
and HiSeq2000 but the model, concepts, and algorithms 
should apply to other Illuminas platforms as well. The 
developed schemes are online (as opposed to batch), scal- 
able, and much faster than competing model-based base 
callers. In addition, they are capable of accounting for soft 
inputs (priors) and generating soft outputs (posteriors) - 
a feature we exploited to devise a supervised scheme for 
learning parameters of the sequencing model, and may 
further be useful in applications where prior knowledge 
about reads is available. 
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